On the Concept of Snowball Sampling 



Mark S. Handcock* Krista J. Gile^ 
August 2, 2011 



Confusion over the definition of "snowball sampling" reflects a phenom- 
ena in the sociology of science: that multi-disciplinary fields tend to produce 
a plethora of inconsistent terminology. Often the meaning of a term evolves 
over time, or different terms are used for the same concept. More confus- 
ing is the use of the same term for different concepts. The term "snowball 
sampling" suffers from this treatment. 

The term "snowball sampl ing" has l i kely b een in informal use for a long 



time, but it certainly pre-dates IColemanl ( 119581 ) and lTrowl ( 119571 ) . The earliest 
systematic work dates to the 1940s from the Columbia Bureau of Applied 
Social Research, lead by Paul Lazarsfeld. The Bure au became int erested in 
the empirical study of personal influence via media ( Barton . 200lh . This led 
to the consideration of interpersonal environments and to the identification 
of opinion leaders and followers. However standard sampling of individuals 
was regarded as ineffective in studying the relations between opinion leaders 
and foll owers as pairs related in this way were seldom both selected in the 
sample ( jLazarsfeld. Berelson. and Gaudetl . ll944l . pp. 49-50). To address this, 
Robert Merton asked individuals in an initial diverse sample to name the 
people who influenced them. From these, a second wave of influential people 



were interviewed as a "snowball sample" (jMertonl . Il949l ). This approach 
was expanded in a panel s ur vey of women in a Midwestern town in 1945 
( IKatz and Lazarsfeld! Il955l ). iBartonl (120011 ) provides a history of the work 



of the Bureau that is still relevant to today's study of social media. 



*Professor of Statistics, Department of Statistics, University of California, Los Angeles, 
CA 90095-1554 (E-mail: handcock@ucla.edu). 

t Assistant Professor of Statistics, Department of Mathematics and Statistics, Univer- 
sity of Massachusetts, Amherst, MA 01003-9305 (E-mail: gile@math.umass.edu). 



1 



Trow's objective was to understand the support for anti-democratic pop- 
ular movements. To do this he conducted an empirical study of the political 
orientations and behaviors of men in Bennington, Vermont in 1954 with 
particular focus on their support for Senator McCarthy. Trow conducted a 
snowball sample over the friendship networks of the men sta rting from " ar- 
bitrarily chosen lists of employees and occupational groups." (jTrowl . 119571 . p. 
297). He is very clear that this does not produce a representative sample, and 
goes on to provide a discussion of the issues with network sampling that is 
still relevant today (jTrowl . 119571 . pp. 290-295). He surmises: "The resulting 
sample, while not meant to be representative of any specific population, nev- 
ertheless includes representatives of all the important occupational groups, 



Following on from these foundations, IColeman. Katz. and Menzell (119571 ) 
used th e approac h to co llect information on influence patterns among physi 



cians. 



Coleman! f|l958h is now the primary reference for the meaning of 
snowball sampling. He defines it as: "Snowball sampling: One method of 
interviewing a man's immediate social environment is to use the sociometric 
questions in the interview for sampling purposes." and describes Trow's work 
as the example. 

Acknowledging IColeman! (119581 ). iGoodmanl (119611 ) introduced "s stage k 



name snowball sampling" , a specific form of snowball sampling. Goodman's 
formulation requires an initial sample drawn using a probability method on 
a known sampling frame. It also fixes parameters of the sampling process: 
the number of links followed from each participant (k) and the number of 
waves of the sample (s). In this work, Goodman develops a rigorous statisti- 
cal approach to estimating certain relational features (number of mutual ties , 
triangles, etc.) based on the resulting sample. Just as lLazarsfeld et al.l (119441 ) 
followed links because they were interested in studying, and therefore sam- 
pling, relationships rather than individuals, Goodman's use of link-tracing is 
motivated by improvements in efficiency allowed by over-sampling relations 
most likely involved in the structures he is studying. 

More recently, the term "snowball sampling" has been taken to refer to 
a convenience sampling mechanism with motivation more like that of Trow: 
collecting a sample from a population in which a standard sampling approach 
is either impossible or prohibitively expensive , for the purpose of studying 



characteristics of individuals in the population iBiernacki and Waldorf! (11981 
e.g., ). Such settings are often hard-to-reach populations, characterized by the 
lack of a serviceable sampling frame. In such cases, an initial probability sam- 



2 



pie is either impossible or impractical, such that the initial sample is drawn by 
a convenience mechanism, dooming the full sample to non-probability sample 
status. In many such hard-to-reach populations, link-tracing sampling is an 
effective means of collecting data on population members. For this reason, 
this latter non-probabilistic usage of "snowball sampling" is most common in 
practice, although less common in the statistical literature, which favors the 
probabilistic formulations. Note that it is possible for the seeds in RDS to 
be chosen randomly even in applications to hard-to-reach populations. For 
example, they could be selected based on a spatial sampling frame. 

The tension betw een these two uses of snowball sampling is highlighted 



m 



Thompson! ( 120021 ). a definitive textbook, (p. 183): "The term 'snowball 



sampling' has been applied to two types of procedures related to network 
sampling. In one type a few identified members of a rare population 
are asked to identify other members of the population, those so identified are 
asked to identify others, and so, for the purpose of obtaining a nonprobability 
sample or for constructing a frame from which to sample. In the other 
type (Goodman 1961), individuals in the sample are asked to identify other 
individuals, for a fixed number of stages, for the purpose of estimating the 
number of 'mutual relationships' or 'social circles' in the population." Other 
definitions of " snowball sampling" are consistent with this duality in usage 
dSniidersl . fl99~2i p. 59). 

Responde nt-driven sampling (RDS, introduced by Heckathorn and col- 
leagues, e.g. iHeckathornl . 119971 ) is a newer variant of link-tracing network 
sampling, which brings to a head the tension between these two usages. This 
is because RDS is a practical sampling method in hard-to-reach populations, 
beginning with a convenience sample, but aims to approximate a probability 
sample over time. 

RDS is not a variant of either usage of snowball samplin g, nor is the re- 
verse t rue. Because of the confusion surrounding this term, in lGile and Handcock 
( 120101 ) we prefer, and use throughout that paper, the more precise broad cate- 
gory "link-tracing sampling" while paying homage to the intellectual descent 
of the methods from snowball sampling. 

It is precisely the tension between the two usages of snowball sampling 
that makes RDS a fruitful area for ongoing research. RDS pairs the practical 
implementation of a convenience sample wit h the hope of recovering " some- 



thing like" a probability sample. I Gild (120081 ) and iGile and Handcockl (120 lOf ) 



are the first works to systematically e yalua t e the statistical properties of cur- 
rent estimators based on RDS data. iGild ( 1201 ll ) proposes a new estimator 



3 



that adjusts for the bias introduced by the with- replacement assumption of 
these estimators. It is also s ometimes possible to adjus t for a convenience 
sampl e of seeds. For example, iGile and Handcockl (1201 if ) extend the estima- 
tor of I Gild ( 1201 ll ) to correct for the bias introduced by seed selection in the 
presence of homophily. 

The issue here, then, is to recognize the different uses of the term "snow- 
ball sampling" . A good solution is for scientists to be as clear as possible in 
defining the meaning of terms upon first use in each manuscript. There is 
enough confusion in the various literatures to make this good practice. 



References 

Allen Barton. Paul lazarsfeld as institutional investor. International Journal 
of Public Opinion Research, 13:245-269, 2001. 

Patrick Biernacki and Dan Waldorf. Snowball sampling: problem and tech- 
niques of chain referral sampling. Sociological Methods and Research, 10: 
141-163, 1981. 

James S. Coleman. Relational analysis: The study of social organizations 
with survey methods. Human Organization, 17:28-36, 1958. 

James S. Coleman, Elihu Katz, and Hazel Menzel. The diffusion of an inno- 
vation among physicians. Sociometry, 20:253-270, 1957. 

Krista J. Gile. Inference from Partially- Observed Network Data. PhD in 
Statistics, University of Washington, 2008. 

Krista J. Gile. Improved inference for respondent-driven sampling data with 
application to hiv prevalence estimation. Journal of the American Statisti- 
cal Association, 106(493) :135-146, 2011. doi: 10.1198/jasa.2011.ap09475. 

Krista J. Gile and Mark S. Handcock. Respondent-driven sampling: An 
assessment of current methodology. Sociological Methodology, 40:285-327, 
2010. URL | http : //arxiv . org/abs/0904 . 1855vl [ 

Krista J. Gile and Mark S. Handcock. Network model-assisted inference 
from respondent-driven sampling data. ArXiv Preprint, 2011. URL 
http : / / arxiv . org/ abs/XXX . 



4 



Leo A. Goodman. Snowball sampling. Annals of Mathematical Statistics, 32: 
148-170, 1961. 

Douglas D. Heckathorn. Respondent-driven sampling: A new approach to 
the study of hidden populations. Social Problems, 44:174-199, 1997. 

Elihu Katz and Paul F. Lazarsfeld. Personal Influence. Free Press, 1955. 

Paul F. Lazarsfeld, Bernard Berelson, and Hazel Gaudet. The People's 
Choice: How the Voter Makes Up His Mind in a Presidential Campaign. 
Duell, Sloan and Pearce, New York, 1944. 

Robert K. Merton. Patterns of influence: A study of interpersonal influence 
and communications behavior in a local community. In Paul F. Lazarsfeld 
and Frank Stanton, editors, Communications Research, 1948-49, pages 
180-219. Harper and Brothers, New York, 1949. 

Thomas A. B. Snijders. Estimation on the basis of snowball samples: how 
to weight. Bulletin Methodologie Sociologique, 36:59-70, 1992. 

Steven K. Thompson. Sampling. Wiley, Second edition, 2002. 

Martin Trow. Right- Wing Radicalism and Political Intolerance. Arno Press, 
New York, 1957. Reprinted 1980. 



5 



